[Chapter Sixteen][Previous]
[Next] [Art of
Assembly][Randall Hyde]
Art of Assembly: Chapter Sixteen
- 16.2 - The UCR Standard Library Pattern
Matching Routines
- 16.3 - The Standard Library Pattern Matching
Functions
- 16.3.1 - Spancset
- 16.3.2 - Brkcset
- 16.3.3 - Anycset
- 16.3.4 - Notanycset
- 16.3.5 - MatchStr
- 16.3.6 - MatchiStr
- 16.3.7 - MatchToStr
- 16.3.8 - MatchChar
- 16.3.9 - MatchToChar
- 16.3.10 - MatchChars
- 16.3.11 - MatchToPat
- 16.3.12 - EOS
- 16.3.13 - ARB
- 16.3.14 - ARBNUM
- 16.3.15 - Skip
- 16.3.16 - Pos
- 16.3.17 - RPos
- 16.3.18 - GotoPos
- 16.3.19 - RGotoPos
- 16.3.20 - SL_Match2
16.2 The UCR Standard Library Pattern Matching Routines
The UCR Standard Library provides a very sophisticated set of pattern
matching routines. They are patterned after the pattern matching facilities
of SNOBOL4, support CFGs, and provide fully automatic backtracking, as necessary.
Furthermore, by writing only five assembly language statements, you can
match simple or complex patterns.
There is very little assembly language code to worry about when using the
Standard Library's pattern matching routines because most of the work occurs
in the data segment. To use the pattern matching routines, you first construct
a pattern data structure in the data segment. You then pass the address
of this pattern and the string you wish to test to the Standard Library
match
routine. The match
routine returns failure
or success depending on the state of the comparison. This isn't quite as
easy as it sounds, though; learning how to construct the pattern data structure
is almost like learning a new programming language. Fortunately, if you've
followed the discussion on context free languages, learning this new "language"
is a breeze.
The Standard Library pattern data structure takes the following form:
Pattern struct
MatchFunction dword ?
MatchParm dword ?
MatchAlt dword ?
NextPattern dword ?
EndPattern word ?
StartPattern word ?
StrSeg word ?
Pattern ends
The MatchFunction
field contains the address of a routine to
call to perform some sort of comparison. The success or failure of this
function determines whether the pattern matches the input string. For example,
the UCR Standard Library provides a MatchStr
function that
compares the next n characters of the input string against some other character
string.
The MatchParm
field contains the address or value of a parameter
(if appropriate) for the MatchFunction
routine. For example,
if the MatchFunction
routine is MatchStr
, then
the MatchParm
field contains the address of the string to compare
the input characters against. Likewise, the MatchChar
routine
compares the next input character in the string against the L.O. byte of
the MatchParm
field. Some matching functions do not require
any parameters, they will ignore any value you assign to MatchParm
field. By convention, most programmers store a zero in unused fields of
the Pattern
structure.
The MatchAlt
field contains either zero (NULL) or the address
of some other pattern data structure. If the current pattern matches the
input characters, the pattern matching routines ignore this field. However,
if the current pattern fails to match the input string, then the pattern
matching routines will attempt to match the pattern whose address appears
in this field. If this alternate pattern returns success, then the pattern
matching routine returns success to the caller, otherwise it returns failure.
If the MatchAlt
field contains NULL, then the pattern matching
routine immediately fails if the main pattern does not match.
The Pattern
data structure only matches one item. For example,
it might match a single character, a single string, or a character from
a set of characters. A real world pattern will probably contain several
small patterns concatenated together, e.g., the pattern for a Pascal identifier
consists of a single character from the set of alphabetic characters followed
by one or more characters from the set [a-zA-Z0-9_]. The NextPattern
field lets you create a composite pattern as the concatenation of two individual
patterns. For such a composite pattern to return success, the current pattern
must match and then the pattern specified by the NextPattern
field must also match. Note that you can chain as many patterns together
as you please using this field.
The last three fields, EndPattern
, StartPattern
,
and StrSeg
are for the internal use of the pattern matching
routine. You should not modify or examine these fields.
Once you create a pattern, it is very easy to test a string to see if it
matches that pattern. The calling sequence for the UCR Standard Library
match
routine is
lesi << Input string to match »
ldxi << Pattern to match string against »
mov cx, 0
match
jc Success
The Standard Library match
routine expects a pointer to the
input string in the es:di
registers; it expects a pointer to
the pattern you want to match in the dx:si
register pair. The
cx
register should contain the length of the string you want
to test. If cx
contains zero, the match routine will test the
entire input string. If cx
contains a nonzero value, the match
routine will only test the first cx
characters in the string.
Note that the end of the string (the zero terminating byte) must not appear
in the string before the position specified in cx
. For most
applications, loading cx
with zero before calling match is
the most appropriate operation.
On return from the match
routine, the carry flag denotes success
or failure. If the carry flag is set, the pattern matches the string; if
the carry flag is clear, the pattern does not match the string. Unlike the
examples given in earlier sections, the match
routine does
not modify the di
register, even if the match succeeds. Instead,
it returns the failure/success position in the ax
register.
The is the position of the first character after the match if match
succeeds, it is the position of the first unmatched character if match
fails.
16.3 The Standard Library Pattern Matching Functions
The UCR Standard Library provides about 20 built-in pattern matching
functions. These functions are based on the pattern matching facilities
provided by the SNOBOL4 programming language, so they are very powerful
indeed! You will probably discover that these routines solve all your pattern
matching need, although it is easy to write your own pattern matching routines
if an appropriate one is not available. The following subsections describe
each of these pattern matching routines in detail.
There are two things you should note if you're using the Standard Library's
SHELL.ASM file when creating programs that use pattern matching and character
sets. First, there is a line at the very beginning of the SHELL.ASM file
that contains the statement "matchfuncs". This line is currently
a comment because it contains a semicolon in column one. If you are going
to be using the pattern matching facilities of the UCR Standard Library,
you need to uncomment this line by deleting the semicolon in column one.
If you are going to be using the character set facilities of the UCR Standard
Library (very common when using the pattern matching facilities), you may
want to uncomment the line containing "include stdsets.a" in the
data segment. The "stdsets.a" file includes several common character
sets, including alphabetics, digits, alphanumerics, whitespace, and so on.
16.3.1 Spancset
The spancset
routine skips over all characters belonging
to a character set. This routine will match zero or more characters in the
specified set and, therefore, always succeeds. The MatchParm
field of the pattern data structure must point at a UCR Standard Library
character set variable.
Example:
SkipAlphas pattern {spancset, alpha}
.
.
.
lesi StringWAlphas
ldxi SkipAlphas
xor cx, cx
match
16.3.2 Brkcset
Brkcset
is the dual to spancset
- it matches
zero or more characters in the input string which are not members of a specified
character set. Another way of viewing brkcset
is that it will
match all characters in the input string up to a character in the specified
character set (or to the end of the string). The matchparm
field contains the address of the character set to match.
Example:
DoDigits pattern {brkcset, digits, 0, DoDigits2}
DoDigits2 pattern {spancset, digits}
.
.
.
lesi StringWDigits
ldxi DoDigits
xor cx, cx
match
jnc NoDigits
The code above matches any string that contains a string of one or more
digits somewhere in the string.
16.3.3 Anycset
Anycset
matches a single character in the input string
from a set of characters. The matchparm
field contains the
address of a character set variable. If the next character in the input
string is a member of this set, anycset
set accepts the string
and skips over than character. If the next input character is not a member
of that set, anycset
returns failure.
Example:
DoID pattern {anycset, alpha, 0, DoID2}
DoID2 pattern {spancset, alphanum}
.
.
.
lesi StringWID
ldxi DoID
xor cx, cx
match
jnc NoID
This code segment checks the string StringWID
to see if it
begins with an identifier specified by the regular expression [a-zA-Z][a-zA-Z0-9]*.
The first subpattern with anycset
makes sure there is an alphabetic
character at the beginning of the string (alpha
is the stdsets.a
set variable that has all the alphabetic characters as members). If the
string does not begin with an alphabetic, the DoID
pattern
fails. The second subpattern, DoID2
, skips over any following
alphanumeric characters using the spancset matching function. Note that
spancset
always succeeds.
The above code does not simply match a string that is an identifier; it
matches strings that begin with a valid identifier. For example, it would
match "ThisIsAnID" as well as "ThisIsAnID+SoIsThis - 5".
If you only want to match a single identifier and nothing else, you must
explicitly check for the end of string in your pattern.
16.3.4 Notanycset
Notanycset
provides the complement to anycset
- it matches a single character in the input string that is not a member
of a character set. The matchparm
field, as usual, contains
the address of the character set whose members must not appear as the next
character in the input string. If notanycset
successfully matches
a character (that is, the next input character is not in the designated
character set), the function skips the character and returns success; otherwise
it returns failure.
Example:
DoSpecial pattern {notanycset, digits, 0, DoSpecial2}
DoSpecial2 pattern {spancset, alphanum}
.
.
.
lesi StringWSpecial
ldxi DoSpecial
xor cx, cx
match
jnc NoSpecial
This code is similar to the DoID
pattern in the previous example.
It matches a string containing any character except a digit and then matches
a string of alphanumeric characters.
16.3.5 MatchStr
Matchstr
compares the next set of input characters against
a character string. The matchparm
field contains the address
of a zero terminated string to compare against. If matchstr
succeeds, it returns the carry set and skips over the characters it matched;
if it fails, it tries the alternate matching function or returns failure
if there is no alternate.
Example:
DoString pattern {matchstr, MyStr}
MyStr byte "Match this!",0
.
.
.
lesi String
ldxi DoString
xor cx, cx
match
jnc NotMatchThis
This sample code matches any string that begins with the characters "Match
This!"
16.3.6 MatchiStr
Matchistr
is like matchstr
insofar as it compares
the next several characters against a zero terminated string value. However,
matchistr
does a case insensitive comparison. During the comparison
it converts the characters in the input string to upper case before comparing
them to the characters that the matchparm field points at. Therefore, the
string pointed at by the matchparm
field must contain uppercase
wherever alphabetics appear. If the matchparm
string contains
any lower case characters, the matchistr
function will always
fail.
Example:
DoString pattern {matchistr, MyStr}
MyStr byte "MATCH THIS!",0
.
.
.
lesi String
ldxi DoString
xor cx, cx
match
jnc NotMatchThis
This example is identical to the one in the previous section except it will
match the characters "match this!" using any combination of upper
and lower case characters.
16.3.7 MatchToStr
Matchtostr
matches all characters in an input string up
to and including the characters specified by the matchparm
parameter. This routine succeeds if the specified string appears somewhere
in the input string, it fails if the string does not appear in the input
string. This pattern function is quite useful for locating a substring and
ignoring everything that came before the substring.
Example:
DoString pattern {matchtostr, MyStr}
MyStr byte "Match this!",0
.
.
.
lesi String
ldxi DoString
xor cx, cx
match
jnc NotMatchThis
Like the previous two examples, this code segment matches the string "Match
this!" However, it does not require that the input string (String
)
begin with "Match this!" Instead, it only requires that "Match
this!" appear somewhere in the string.
16.3.8 MatchChar
The matchchar
function matches a single character. The
matchparm
field's L.O. byte contains the character you want
to match. If the next character in the input string is that character, then
this function succeeds, otherwise it fails.
Example:
DoSpace pattern {matchchar, ' '}
.
.
.
lesi String
ldxi DoSpace
xor cx, cx
match
jnc NoSpace
This code segment matches any string that begins with a space. Keep in mind
that the match
routine only checks the prefix of a string.
If you wanted to see if the string contained only a space (rather than a
string that begins with a space), you would need to explicitly check for
an end of string after the space. Of course, it would be far more efficient
to use strcmp
rather than match
for this purpose!
Note that unlike matchstr
, you encode the character you want
to match directly into the matchparm
field. This lets you specify
the character you want to test directly in the pattern definition.
16.3.9 MatchToChar
Like matchtostr
, matchtochar
matches all characters
up to and including a character you specify. This is similar to brkcset
except you don't have to create a character set containing a single member
and brkcset
skips up to but not including the specified character(s).
Matchtochar
fails if it cannot find the specified character
in the input string.
Example:
DoToSpace pattern {matchtochar, ' '}
.
.
.
lesi String
ldxi DoSpace
xor cx, cx
match
jnc NoSpace
This call to match
will fail if there are no spaces left in
the input string. If there are, the call to matchtochar
will
skip over all characters up to, and including, the first space. This is
a useful pattern for skipping over words in a string.
16.3.10 MatchChars
Matchchars
skips zero or more occurrences of a singe character
in an input string. It is similar to spancset
except you can
specify a single character rather than an entire character set with a single
member. Like matchchar
, matchchars
expects a single
character in the L.O. byte of the matchparm
field. Since this
routine matches zero or more occurrences of that character, it always succeeds.
Example:
Skip2NextWord pattern {matchtochar, ' ', 0, SkipSpcs}
SkipSpcs pattern {matchchars, ' '}
.
.
.
lesi String
ldxi Skip2NextWord
xor cx, cx
match
jnc NoWord
The code segment skips to the beginning of the next word in a string. It
fails if there are no additional words in the string (i.e., the string contains
no spaces).
16.3.11 MatchToPat
Matchtopat
matches all characters in a string up to and
including the substring matched by some other pattern. This is one of the
two facilities the UCR Standard Library pattern matching routines provide
to allow the implementation of nonterminal function calls. This matching
function succeeds if it finds a string matching the specified pattern somewhere
on the line. If it succeeds, it skips the characters through the last character
matched by the pattern parameter. As you would expect, the matchparm
field contains the address of the pattern to match.
Example:
; Assume there is a pattern "expression" that matches arithmetic
; expressions. The following pattern determines if there is such an
; expression on the line followed by a semicolon.
FindExp pattern {matchtopat, expression, 0, MatchSemi}
MatchSemi pattern {matchchar, ';'}
.
.
.
lesi String
ldxi FindExp
xor cx, cx
match
jnc NoExp
16.3.12 EOS
The EOS
pattern matches the end of a string. This pattern,
which must obviously appear at the end of a pattern list if it appears at
all, checks for the zero terminating byte. Since the Standard Library routines
only match prefixes, you should stick this pattern at the end of a list
if you want to ensure that a pattern exactly matches a string with no left
over characters at the end. EOS
succeeds if it matches the
zero terminating byte, it fails otherwise.
Example:
SkipNumber pattern {anycset, digits, 0, SkipDigits}
SkipDigits pattern {spancset, digits, 0, EOSPat}
EOSPat pattern {EOS}
.
.
.
lesi String
ldxi SkipNumber
xor cx, cx
match
jnc NoNumber
The SkipNumber
pattern matches strings that contain only decimal
digits (from the start of the match to the end of the string). Note that
EOS
requires no parameters, not even a matchparm parameter.
16.3.13 ARB
ARB
matches any number of arbitrary characters. This pattern
matching function is equivalent to
*. Note that ARB
is a very inefficient routine to use. It works by assuming it can match
all remaining characters in the string and then tries to match the pattern
specified by the nextpattern
field. If the nextpattern
item fails, ARB
backs up one character and tries matching nextpattern
again. This continues until the pattern specified by nextpattern
succeeds or ARB
backs up to its initial starting position.
ARB
succeeds if the pattern specified by nextpattern
succeeds, it fails if it backs up to its initial starting position.
Given the enormous amount of backtracking that can occur with ARB
(especially on long strings), you should try to avoid using this pattern
if at all possible. The matchtostr
, matchtochar
,
and matchtopat
functions accomplish much of what ARB
accomplishes, but they work forward rather than backward in the source string
and may be more efficient. ARB
is useful mainly if you're sure
the following pattern appears late in the string you're matching or if the
string you want to match occurs several times and you want to match the
last occurrence (matchtostr
, matchtochar
, and
matchtopat
always match the first occurrence they find).
Example:
SkipNumber pattern {ARB,0,0,SkipDigit}
SkipDigit pattern {anycset, digits, 0, SkipDigits}
SkipDigits pattern {spancset, digits}
.
.
.
lesi String
ldxi SkipNumber
xor cx, cx
match
jnc NoNumber
This code example matches the last number that appears on an input line.
Note that ARB
does not use the matchparm
field,
so you should set it to zero by default.
16.3.14 ARBNUM
ARBNUM
matches an arbitrary number (zero or more) of patterns
that occur in the input string. If R represents some nonterminal
number (pattern matching function), then ARBNUM
(R
) is equivalent to the production ARBNUM
R ARBNUM |
.
The matchparm
field contains the address of the pattern that
ARBNUM
attempts to match.
Example:
SkipNumbers pattern {ARBNUM, SkipNumber}
SkipNumber pattern {anycset, digits, 0, SkipDigits}
SkipDigits pattern {spancset, digits, 0, EndDigits}
EndDigits pattern {matchchars, ' ', EndString}
EndString pattern {EOS}
.
.
.
lesi String
ldxi SkipNumbers
xor cx, cx
match
jnc IllegalNumbers
This code accepts the input string if it consists of a sequence of zero
or more numbers separated by spaces and terminated with the EOS
pattern. Note the use of the matchalt
field in the EndDigits
pattern to select EOS
rather than a space for the last number
in the string.
16.3.15 Skip
Skip
matches n arbitrary characters in the input string.
The matchparm
field is an integer value containing the number
of characters to skip. Although the matchparm
field is a double
word, this routine limits the number of characters you can skip to 16 bits
(65,535 characters); that is, n is the L.O. word of the matchparm
field. This should prove sufficient for most needs.
Skip
succeeds if there are at least n characters left in the
input string; it fails if there are fewer than n characters left in the
input string.
Example:
Skip1st6 pattern {skip, 6, 0, SkipNumber}
SkipNumber pattern {anycset, digits, 0, SkipDigits}
SkipDigits pattern {spancset, digits, 0, EndDigits}
EndDigits pattern {EOS}
.
.
.
lesi String
ldxi Skip1st6
xor cx, cx
match
jnc IllegalItem
This example matches a string containing six arbitrary characters followed
by one or more decimal digits and a zero terminating byte.
16.3.16 Pos
Pos
succeeds if the matching functions are currently at
the nth character in the string, where n is the value in the L.O. word of
the matchparm
field. Pos
fails if the matching
functions are not currently at position n in the string. Unlike the pattern
matching functions you've seen so far, pos
does not consume
any input characters. Note that the string starts out at position zero.
So when you use the pos
function, it succeeds if you've matched
n characters at that point.
Example:
SkipNumber pattern {anycset, digits, 0, SkipDigits}
SkipDigits pattern {spancset, digits, 0, EndDigits}
EndDigits pattern {pos, 4}
.
.
.
lesi String
ldxi SkipNumber
xor cx, cx
match
jnc IllegalItem
This code matches a string that begins with exactly 4 decimal digits.
16.3.17 RPos
Rpos
works quite a bit like the pos
function
except it succeeds if the current position is n character positions from
the end of the string. Like pos
, n is the L.O. 16 bits of the
matchparm
field. Also like pos
, rpos
does not consume any input characters.
Example:
SkipNumber pattern {anycset, digits, 0, SkipDigits}
SkipDigits pattern {spancset, digits, 0, EndDigits}
EndDigits pattern {rpos, 4}
.
.
.
lesi String
ldxi SkipNumber
xor cx, cx
match
jnc IllegalItem
This code matches any string that is all decimal digits except for the last
four characters of the string. The string must be at least five characters
long for the above pattern match to succeed.
16.3.18 GotoPos
Gotopos
skips over any characters in the string until it
reaches character position n in the string. This function fails if the pattern
is already beyond position n in the string. The L.O. word of the matchparm
field contains the value for n.
Example:
SkipNumber pattern {gotopos, 10, 0, MatchNmbr}
MatchNmbr pattern {anycset, digits, 0, SkipDigits}
SkipDigits pattern {spancset, digits, 0, EndDigits}
EndDigits pattern {rpos, 4}
.
.
.
lesi String
ldxi SkipNumber
xor cx, cx
match
jnc IllegalItem
This example code skips to position 10 in the string and attempts to match
a string of digits starting with the 11th character. This pattern succeeds
if the there are four characters remaining in the string after processing
all the digits.
16.3.19 RGotoPos
Rgotopos
works like gotopos
except it goes
to the position specified from the end of the string. Rgotopos
fails if the matching routines are already beyond position n from the end
of the string. As with gotopos
, the L.O. word of the matchparm
field contains the value for n.
Example:
SkipNumber pattern {rgotopos, 10, 0, MatchNmbr}
MatchNmbr pattern {anycset, digits, 0, SkipDigits}
SkipDigits pattern {spancset, digits}
.
.
.
lesi String
ldxi SkipNumber
xor cx, cx
match
jnc IllegalItem
This example skips to ten characters from the end of the string and then
attempts to match one or digits starting at that point. It fails if there
aren't at least 11 characters in the string or the last 10 characters don't
begin with a string of one or more digits.
16.3.20 SL_Match2
The sl_match2
routine is nothing more than a recursive
call to match. The matchparm
field contains the address of
pattern to match. This is quite useful for simulating parenthesis around
a pattern in a pattern expression. As far as matching strings are concerned,
pattern1
and pattern2
, below, are equivalent:
Pattern2 pattern {sl_match2, Pattern1}
Pattern1 pattern {matchchar, 'a'}
The only difference between invoking a pattern directly and invoking it
with sl_match2
is that sl_match2
tweaks some internal
variables to keep track of matching positions within the input string. Later,
you can extract the character string matched by sl_match2
using
the patgrab
routine.
- 16.2 - The UCR Standard Library
Pattern Matching Routines
- 16.3 - The Standard Library Pattern Matching
Functions
- 16.3.1 - Spancset
- 16.3.2 - Brkcset
- 16.3.3 - Anycset
- 16.3.4 - Notanycset
- 16.3.5 - MatchStr
- 16.3.6 - MatchiStr
- 16.3.7 - MatchToStr
- 16.3.8 - MatchChar
- 16.3.9 - MatchToChar
- 16.3.10 - MatchChars
- 16.3.11 - MatchToPat
- 16.3.12 - EOS
- 16.3.13 - ARB
- 16.3.14 - ARBNUM
- 16.3.15 - Skip
- 16.3.16 - Pos
- 16.3.17 - RPos
- 16.3.18 - GotoPos
- 16.3.19 - RGotoPos
- 16.3.20 - SL_Match2
Art of Assembly: Chapter Sixteen - 29 SEP 1996
[Chapter Sixteen][Previous]
[Next] [Art of
Assembly][Randall Hyde]